⚡️ Speed up method `DocumentUrl._infer_media_type` by 12% in PR #35 (`trigger-cf-workflow`) #36

codeflash-ai · 2025-07-25T03:11:30Z

⚡️ This pull request contains optimizations for PR #35

If you approve this dependent PR, these changes will be merged into the original PR branch trigger-cf-workflow.

This PR will be automatically closed if the original PR is merged.

📄 12% (0.12x) speedup for `DocumentUrl._infer_media_type` in `pydantic_ai_slim/pydantic_ai/messages.py`

⏱️ Runtime : 23.8 milliseconds → 21.3 milliseconds (best of 30 runs)

📝 Explanation and details

Here is an optimized version of your Python program. Major optimizations.

Caches the result of guess_type per unique URL using functools.lru_cache, which reduces repeated MIME type computations (especially on large scale repeated calls).
Since the class is supposed to inherit from FileUrl, it is best to avoid repeating the dataclass and repr decorators if already present in the parent (maintaining runtime correctness and consistency).
Removed imports that are not used in this file to reduce module loading time.
The code preserves all functionality and the original function signatures.

Notes.

The _guess_type_cached helper is a staticmethod, so it's shared across all instances and efficiently caches guess_type results.
If your usage pattern always has unique URLs, set maxsize=None to cache unlimited.
This optimization especially benefits use-cases where the same URL may have its media-type inferred more than once.
The dataclass and repr decorators are not required here because FileUrl already establishes the base data model and behaviors for you.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 7706 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

from abc import ABC
from dataclasses import dataclass, field
from mimetypes import guess_type
from typing import Any, Literal

# imports
import pytest  # used for our unit tests
from pydantic_ai.messages import DocumentUrl


@dataclass(init=False, repr=False)
class FileUrl(ABC):
    """Abstract base class for any URL-based file."""

    url: str
    force_download: bool = False
    vendor_metadata: dict[str, Any] | None = None
    _media_type: str | None = field(init=False, repr=False)

    def __init__(
        self,
        url: str,
        force_download: bool = False,
        vendor_metadata: dict[str, Any] | None = None,
        media_type: str | None = None,
    ) -> None:
        self.url = url
        self.vendor_metadata = vendor_metadata
        self.force_download = force_download
        self._media_type = media_type

    # Omitting __repr__ for test purposes
from pydantic_ai.messages import DocumentUrl

# unit tests

# -------------------------------
# 1. Basic Test Cases
# -------------------------------

@pytest.mark.parametrize(
    "url,expected_mime",
    [
        # Standard PDF file
        ("https://example.com/file.pdf", "application/pdf"),
        # Standard Word docx
        ("https://example.com/file.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
        # Standard Word doc
        ("https://example.com/file.doc", "application/msword"),
        # Standard plain text
        ("https://example.com/file.txt", "text/plain"),
        # Standard HTML
        ("https://example.com/file.html", "text/html"),
        # Standard JPEG image
        ("https://example.com/file.jpg", "image/jpeg"),
        # Standard PNG image
        ("https://example.com/file.png", "image/png"),
        # Standard CSV
        ("https://example.com/file.csv", "text/csv"),
        # Standard JSON
        ("https://example.com/file.json", "application/json"),
        # Standard ZIP
        ("https://example.com/file.zip", "application/zip"),
    ]
)
def test_infer_media_type_basic(url, expected_mime):
    """Test that _infer_media_type returns correct MIME type for common extensions."""
    d = DocumentUrl(url)
    codeflash_output = d._infer_media_type() # 183μs -> 10.8μs (1596% faster)

# -------------------------------
# 2. Edge Test Cases
# -------------------------------

@pytest.mark.parametrize(
    "url,expected_mime",
    [
        # Uppercase extension
        ("https://example.com/file.PDF", "application/pdf"),
        # Mixed case extension
        ("https://example.com/file.JpEg", "image/jpeg"),
        # Extension with query string
        ("https://example.com/file.pdf?version=2", "application/pdf"),
        # Extension with fragment
        ("https://example.com/file.pdf#section", "application/pdf"),
        # Filename with spaces
        ("https://example.com/my file.txt", "text/plain"),
        # Filename with multiple dots
        ("https://example.com/archive.tar.gz", "application/x-tar"),
        # Filename with no extension but a dot
        ("https://example.com/file.", None),  # Should raise
        # No extension at all
        ("https://example.com/file", None),   # Should raise
        # Hidden file (starts with .)
        ("https://example.com/.hidden.pdf", "application/pdf"),
        # Extension with unusual but valid characters
        ("https://example.com/file.name.with.many.dots.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
        # Extension with semicolon in query
        ("https://example.com/file.csv;foo=bar", "text/csv"),
    ]
)
def test_infer_media_type_edge(url, expected_mime):
    """Test edge cases for file extension and URL variants."""
    d = DocumentUrl(url)
    if expected_mime is None:
        # Should raise ValueError for unknown extension
        with pytest.raises(ValueError):
            d._infer_media_type() # 171μs -> 142μs (20.1% faster)
    else:
        codeflash_output = d._infer_media_type()

def test_infer_media_type_empty_url():
    """Test that empty url raises ValueError."""
    d = DocumentUrl("")
    with pytest.raises(ValueError):
        d._infer_media_type() # 14.5μs -> 15.4μs (6.24% slower)

def test_infer_media_type_url_with_only_query():
    """Test that a url with only a query string raises ValueError."""
    d = DocumentUrl("?foo=bar")
    with pytest.raises(ValueError):
        d._infer_media_type() # 15.4μs -> 16.3μs (5.55% slower)

def test_infer_media_type_url_with_only_fragment():
    """Test that a url with only a fragment raises ValueError."""
    d = DocumentUrl("#fragment")
    with pytest.raises(ValueError):
        d._infer_media_type() # 15.3μs -> 16.3μs (5.67% slower)

def test_infer_media_type_url_with_path_but_no_extension():
    """Test that a path with no extension raises ValueError."""
    d = DocumentUrl("https://example.com/path/to/file")
    with pytest.raises(ValueError):
        d._infer_media_type() # 19.2μs -> 19.9μs (3.72% slower)

def test_infer_media_type_weird_extension():
    """Test that an unknown/weird extension raises ValueError."""
    d = DocumentUrl("https://example.com/file.unknownext")
    with pytest.raises(ValueError):
        d._infer_media_type() # 19.6μs -> 2.10μs (833% faster)

def test_infer_media_type_url_with_port():
    """Test that a url with a port is handled correctly."""
    d = DocumentUrl("https://example.com:8080/file.pdf")
    codeflash_output = d._infer_media_type() # 18.9μs -> 19.1μs (1.42% slower)

def test_infer_media_type_url_with_long_path():
    """Test a long path with valid extension."""
    d = DocumentUrl("https://example.com/a/b/c/d/e/f/g/h/i/j/file.txt")
    codeflash_output = d._infer_media_type() # 18.7μs -> 19.5μs (4.36% slower)

# -------------------------------
# 3. Large Scale Test Cases
# -------------------------------

def test_infer_media_type_many_urls_pdf():
    """Test 1000 PDF URLs for scalability and performance."""
    urls = [f"https://example.com/file_{i}.pdf" for i in range(1000)]
    for url in urls:
        d = DocumentUrl(url)
        codeflash_output = d._infer_media_type() # 5.72ms -> 6.07ms (5.85% slower)

def test_infer_media_type_many_urls_mixed():
    """Test 1000 mixed URLs for scalability and correctness."""
    extensions = [
        ("pdf", "application/pdf"),
        ("docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
        ("jpg", "image/jpeg"),
        ("csv", "text/csv"),
        ("txt", "text/plain"),
        ("html", "text/html"),
    ]
    urls = []
    expected = []
    for i in range(1000):
        ext, mime = extensions[i % len(extensions)]
        url = f"https://example.com/file_{i}.{ext}"
        urls.append(url)
        expected.append(mime)
    for url, mime in zip(urls, expected):
        d = DocumentUrl(url)
        codeflash_output = d._infer_media_type() # 5.84ms -> 5.24ms (11.5% faster)

def test_infer_media_type_large_batch_with_some_invalid():
    """Test a batch of 1000 URLs with 10% invalid extensions."""
    valid_exts = [
        ("pdf", "application/pdf"),
        ("docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
        ("jpg", "image/jpeg"),
    ]
    urls = []
    expected = []
    for i in range(1000):
        if i % 10 == 0:
            # Invalid extension
            urls.append(f"https://example.com/file_{i}.invalidext")
            expected.append(None)
        else:
            ext, mime = valid_exts[i % len(valid_exts)]
            urls.append(f"https://example.com/file_{i}.{ext}")
            expected.append(mime)
    for url, mime in zip(urls, expected):
        d = DocumentUrl(url)
        if mime is None:
            with pytest.raises(ValueError):
                d._infer_media_type()
        else:
            codeflash_output = d._infer_media_type()
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from abc import ABC
from dataclasses import dataclass, field
from mimetypes import guess_type
from typing import Any, Literal

# imports
import pytest  # used for our unit tests
from pydantic_ai.messages import DocumentUrl


# Dummy _utils for repr, since actual pydantic_ai._utils is unavailable
class _utils:
    @staticmethod
    def dataclasses_no_defaults_repr(self):
        return f"<{type(self).__name__} url={self.url!r}>"

@dataclass(init=False, repr=False)
class FileUrl(ABC):
    """Abstract base class for any URL-based file."""

    url: str
    force_download: bool = False
    vendor_metadata: dict[str, Any] | None = None
    _media_type: str | None = field(init=False, repr=False)

    def __init__(
        self,
        url: str,
        force_download: bool = False,
        vendor_metadata: dict[str, Any] | None = None,
        media_type: str | None = None,
    ) -> None:
        self.url = url
        self.vendor_metadata = vendor_metadata
        self.force_download = force_download
        self._media_type = media_type

    __repr__ = _utils.dataclasses_no_defaults_repr
from pydantic_ai.messages import DocumentUrl

# unit tests

# -------------------------
# Basic Test Cases
# -------------------------

def test_pdf_file_url():
    # Test a standard PDF file URL
    doc = DocumentUrl(url="https://example.com/file.pdf")
    codeflash_output = doc._infer_media_type() # 18.1μs -> 19.3μs (5.88% slower)

def test_txt_file_url():
    # Test a standard TXT file URL
    doc = DocumentUrl(url="https://example.com/file.txt")
    codeflash_output = doc._infer_media_type() # 18.4μs -> 19.1μs (3.68% slower)

def test_docx_file_url():
    # Test a DOCX file URL
    doc = DocumentUrl(url="https://example.com/file.docx")
    # mimetypes may not always know about docx, but on most systems it does
    codeflash_output = doc._infer_media_type() # 18.4μs -> 19.0μs (3.58% slower)

def test_html_file_url():
    # Test an HTML file URL
    doc = DocumentUrl(url="https://example.com/index.html")
    codeflash_output = doc._infer_media_type() # 18.3μs -> 19.2μs (4.81% slower)

def test_csv_file_url():
    # Test a CSV file URL
    doc = DocumentUrl(url="https://example.com/data.csv")
    codeflash_output = doc._infer_media_type() # 18.3μs -> 19.0μs (3.79% slower)

# -------------------------
# Edge Test Cases
# -------------------------

def test_uppercase_extension():
    # Test a file URL with an uppercase extension
    doc = DocumentUrl(url="https://example.com/FILE.PDF")
    # mimetypes is case-insensitive for extensions
    codeflash_output = doc._infer_media_type() # 18.1μs -> 19.3μs (6.26% slower)

def test_extension_with_query_params():
    # Test a file URL with query parameters after the extension
    doc = DocumentUrl(url="https://example.com/file.pdf?download=true")
    codeflash_output = doc._infer_media_type() # 19.4μs -> 20.3μs (4.39% slower)

def test_extension_with_fragment():
    # Test a file URL with a fragment after the extension
    doc = DocumentUrl(url="https://example.com/file.pdf#section1")
    codeflash_output = doc._infer_media_type() # 19.1μs -> 20.0μs (4.85% slower)

def test_url_with_multiple_dots():
    # Test a file URL with multiple dots in the filename
    doc = DocumentUrl(url="https://example.com/my.file.v1.pdf")
    codeflash_output = doc._infer_media_type() # 18.2μs -> 19.3μs (5.46% slower)

def test_url_with_no_extension():
    # Test a file URL with no extension
    doc = DocumentUrl(url="https://example.com/file")
    with pytest.raises(ValueError, match="Unknown document file extension: https://example.com/file"):
        doc._infer_media_type() # 18.7μs -> 19.4μs (3.72% slower)

def test_url_with_unknown_extension():
    # Test a file URL with an unknown extension
    doc = DocumentUrl(url="https://example.com/file.unknownext")
    with pytest.raises(ValueError, match="Unknown document file extension: https://example.com/file.unknownext"):
        doc._infer_media_type() # 19.3μs -> 20.6μs (6.75% slower)

def test_url_with_hidden_file():
    # Test a file URL with a hidden file (starts with a dot)
    doc = DocumentUrl(url="https://example.com/.hidden.pdf")
    codeflash_output = doc._infer_media_type() # 18.5μs -> 19.7μs (5.85% slower)

def test_url_with_path_and_extension():
    # Test a file URL with a path and extension
    doc = DocumentUrl(url="https://example.com/path/to/file.csv")
    codeflash_output = doc._infer_media_type() # 18.4μs -> 19.1μs (3.88% slower)

def test_url_with_spaces_encoded():
    # Test a file URL with spaces encoded as %20
    doc = DocumentUrl(url="https://example.com/my%20file.txt")
    codeflash_output = doc._infer_media_type() # 18.0μs -> 18.9μs (4.36% slower)

def test_url_with_plus_in_filename():
    # Test a file URL with plus signs in the filename
    doc = DocumentUrl(url="https://example.com/file+name.pdf")
    codeflash_output = doc._infer_media_type() # 18.4μs -> 18.8μs (2.34% slower)

def test_url_with_semicolon_in_filename():
    # Test a file URL with semicolon in the filename
    doc = DocumentUrl(url="https://example.com/file;v=1.pdf")
    codeflash_output = doc._infer_media_type()

def test_url_with_multiple_query_params():
    # Test a file URL with multiple query parameters
    doc = DocumentUrl(url="https://example.com/file.txt?foo=bar&baz=qux")
    codeflash_output = doc._infer_media_type() # 19.5μs -> 21.0μs (7.24% slower)

def test_url_with_port_number():
    # Test a file URL with a port number
    doc = DocumentUrl(url="https://example.com:8080/file.txt")
    codeflash_output = doc._infer_media_type() # 18.5μs -> 19.4μs (5.00% slower)

def test_url_with_subdomain():
    # Test a file URL with a subdomain
    doc = DocumentUrl(url="https://files.example.com/file.txt")
    codeflash_output = doc._infer_media_type() # 18.4μs -> 19.2μs (3.91% slower)

def test_url_with_long_extension():
    # Test a file URL with a long extension (e.g., .tar.gz)
    doc = DocumentUrl(url="https://example.com/archive.tar.gz")
    # mimetypes returns the type for the last extension, which is .gz
    codeflash_output = doc._infer_media_type() # 20.2μs -> 20.9μs (3.73% slower)

def test_url_with_strange_but_known_extension():
    # Test a file URL with a known but uncommon extension
    doc = DocumentUrl(url="https://example.com/file.rtf")
    codeflash_output = doc._infer_media_type() # 18.8μs -> 19.7μs (4.48% slower)

def test_url_with_dot_at_end():
    # Test a file URL with a dot at the end (should not match any extension)
    doc = DocumentUrl(url="https://example.com/file.")
    with pytest.raises(ValueError, match="Unknown document file extension: https://example.com/file."):
        doc._infer_media_type() # 20.0μs -> 21.0μs (4.69% slower)

def test_url_with_double_extension():
    # Test a file URL with a double extension (e.g., .tar.bz2)
    doc = DocumentUrl(url="https://example.com/archive.tar.bz2")
    # mimetypes returns the type for the last extension, which is .bz2
    codeflash_output = doc._infer_media_type() # 19.7μs -> 20.3μs (2.87% slower)

def test_url_with_leading_trailing_spaces():
    # Test a file URL with leading/trailing spaces in the URL
    doc = DocumentUrl(url="  https://example.com/file.txt  ")
    # guess_type strips spaces
    codeflash_output = doc._infer_media_type()

def test_url_with_unicode_characters():
    # Test a file URL with unicode characters in the filename
    doc = DocumentUrl(url="https://example.com/文件.pdf")
    codeflash_output = doc._infer_media_type() # 20.2μs -> 20.8μs (2.98% slower)

def test_url_with_no_scheme():
    # Test a file URL with no scheme (should still work if extension is present)
    doc = DocumentUrl(url="example.com/file.pdf")
    codeflash_output = doc._infer_media_type() # 14.7μs -> 15.7μs (6.82% slower)

# -------------------------
# Large Scale Test Cases
# -------------------------

def test_large_batch_of_known_extensions():
    # Test a large batch of known extensions for scalability
    known_extensions = [
        "pdf", "txt", "doc", "docx", "xls", "xlsx", "ppt", "pptx", "csv", "rtf", "html", "htm",
        "json", "xml", "zip", "gz", "bz2", "tar", "jpg", "jpeg", "png", "gif", "bmp", "svg", "mp3",
        "wav", "mp4", "avi", "mov", "wmv", "flv", "mkv", "webm", "ogg", "m4a", "3gp", "ts", "aac",
        "odt", "ods", "odp", "epub", "mobi", "azw", "djvu", "ps", "tex", "log", "md"
    ]
    # Limit to 100 extensions for performance
    for ext in known_extensions[:100]:
        url = f"https://example.com/file.{ext}"
        doc = DocumentUrl(url=url)
        type_, _ = guess_type(url)
        if type_ is not None:
            codeflash_output = doc._infer_media_type()
        else:
            with pytest.raises(ValueError):
                doc._infer_media_type()

def test_large_batch_of_unknown_extensions():
    # Test a large batch of unknown extensions for scalability
    for i in range(100):
        url = f"https://example.com/file.unknown{i}"
        doc = DocumentUrl(url=url)
        with pytest.raises(ValueError):
            doc._infer_media_type()

def test_large_batch_of_mixed_extensions():
    # Test a large batch of mixed known and unknown extensions
    for i in range(50):
        # Known extension
        url_known = f"https://example.com/file{i}.pdf"
        doc_known = DocumentUrl(url=url_known)
        codeflash_output = doc_known._infer_media_type() # 370μs -> 391μs (5.38% slower)
        # Unknown extension
        url_unknown = f"https://example.com/file{i}.zzz"
        doc_unknown = DocumentUrl(url=url_unknown)
        with pytest.raises(ValueError):
            doc_unknown._infer_media_type()

def test_large_batch_with_query_params_and_fragments():
    # Test a large batch of URLs with query params and fragments
    for i in range(50):
        url = f"https://example.com/file{i}.txt?foo=bar#{i}"
        doc = DocumentUrl(url=url)
        codeflash_output = doc._infer_media_type() # 359μs -> 379μs (5.28% slower)

def test_performance_large_number_of_urls():
    # Test performance with a large number of valid URLs (under 1000)
    urls = [f"https://example.com/file{i}.pdf" for i in range(500)]
    docs = [DocumentUrl(url=url) for url in urls]
    for doc in docs:
        codeflash_output = doc._infer_media_type() # 2.77ms -> 2.75ms (0.777% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr35-2025-07-25T03.11.24 and push.

…`trigger-cf-workflow`) Here is an optimized version of your Python program. Major optimizations. - Caches the result of `guess_type` per unique URL using `functools.lru_cache`, which reduces repeated MIME type computations (especially on large scale repeated calls). - Since the class is supposed to inherit from `FileUrl`, it is best to avoid repeating the dataclass and repr decorators if already present in the parent (maintaining runtime correctness and consistency). - Removed imports that are not used in this file to reduce module loading time. - The code preserves all functionality and the original function signatures. #### Notes. - The `_guess_type_cached` helper is a staticmethod, so it's shared across all instances and efficiently caches guess_type results. - If your usage pattern always has unique URLs, set `maxsize=None` to cache unlimited. - This optimization especially benefits use-cases where the same URL may have its media-type inferred more than once. - The `dataclass` and `repr` decorators are *not required* here because `FileUrl` already establishes the base data model and behaviors for you.

KRRT7 and others added 2 commits July 24, 2025 19:48

Update google.py

7dc9ff0

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jul 25, 2025

codeflash-ai bot mentioned this pull request Jul 25, 2025

Trigger cf workflow #35

Draft

KRRT7 force-pushed the trigger-cf-workflow branch from eee4872 to dddb328 Compare July 29, 2025 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `DocumentUrl._infer_media_type` by 12% in PR #35 (`trigger-cf-workflow`) #36

⚡️ Speed up method `DocumentUrl._infer_media_type` by 12% in PR #35 (`trigger-cf-workflow`) #36

Uh oh!

codeflash-ai bot commented Jul 25, 2025

Uh oh!

Uh oh!

⚡️ Speed up method DocumentUrl._infer_media_type by 12% in PR #35 (trigger-cf-workflow) #36

Are you sure you want to change the base?

⚡️ Speed up method DocumentUrl._infer_media_type by 12% in PR #35 (trigger-cf-workflow) #36

Uh oh!

Conversation

codeflash-ai bot commented Jul 25, 2025

⚡️ This pull request contains optimizations for PR #35

📄 12% (0.12x) speedup for DocumentUrl._infer_media_type in pydantic_ai_slim/pydantic_ai/messages.py

📝 Explanation and details

Notes.

Uh oh!

Uh oh!

⚡️ Speed up method `DocumentUrl._infer_media_type` by 12% in PR #35 (`trigger-cf-workflow`) #36

⚡️ Speed up method `DocumentUrl._infer_media_type` by 12% in PR #35 (`trigger-cf-workflow`) #36

📄 12% (0.12x) speedup for `DocumentUrl._infer_media_type` in `pydantic_ai_slim/pydantic_ai/messages.py`